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I.  INTRODUCTION 


In  this  paper  two  dual  algorithms  for  ths  efficient 
concurrent  computation  of  partial  sums  ars  introduced  and  their 
performance  assessed.  These  algorithms  are  members  of  a  novel 
class  of  algorithms  and  arch itect ures  that  is  particularly  suited 
for  arithmetic  intensive,  high  throughput  computing.  This  class 
is  based  on  partitioning  the  desired  computations  into  parts  that 


can 

assume 

a 

relatively  small  number  of 

dist inct 

forms. 

The 

redundancy 

resulting  from  the  appearance 

of  a  given 

f  orm 

more 

than 

once 

is 

removed  by  computing  each 

form  only 

once. 

The 

computat ion 

of 

all  the  distinct  forms  is 

performed 

first, 

and 

then 

combined 

appropriately  to  obtain  the 

desired  results. 

The 

partitioning  size  is  parameter i zed  by  a  partition  parameter,  or 
parameters.  0  cost  function  is  defined  to  take  into  account  all 
the  relevant  factors  such  as  the  number  of  operations,  chip  area, 
and  computation  time.  This  cost  function  is,  after  part  it ioning, 
a  function  of  the  partition  parameter(s)  with  respect  to  which  it 
could  be  minimized.  The  minimum  cost  is  attained  at  certain 
optimal  partition  sizes,  that  should  be  used  in  the 
implementat ion. 

The  partial  sums  computation  of  interest  to  us  in  this  paper 
are  expressed  in  the  form 

Y  ■  BX  +  U  (1.1) 

where  the  U,  X,  B,  and  Y  are  of  dimensions  Mxl,  Nxl,  MxN,  and 
Mxl,  respectively.  U  and  X  are  data  vectors,  and  B  is  a  binary 

matrix  of  zeros  and  ones  that  define  the  desired  partial  sums  Y. 

\ 

The  vector  X  contains  all  the  data  from  which  subsets  for  the 
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various  partial  sums  ars  salactad.  The  vactor  U  is  addsa  cnly  to 


complsto  ths  duality  of  tha  two  algorithms,  and  has  an 
insignif icant  offset  on  thsir  bshavior. 

Ths  first  algorithm  to  computa  thsss  partial  sums  is  based 
on  partitioning  ths  output,  Y.  In  ssetion  II  ths  praliminary 
vsrsion  of  this  algorithm  prsssntsd  in  Cl,  2D  is  complstsd.  0 
dual  algorithm  bassd  on  partitioning  X,  ths  input,  is  prsssntsd 
in  ssetion  III  whsrs  ths  complsts  duality  of  ths  two  algorithms 
is  sstablishsd.  In  asssssing  the  psrformancs  of  ths  algorithms, 
soms  paramstsrs  ars  trsatsd  initially  as  continuous,  whils  only 
intsgsr  valuss  could  bs  ussd  in  ths  implsmsntat ion.  This  is  shown 
in  ssetion  IV  to  havs  an  insignif icant  offset  on  ths  psrformancs 
of  ths  algorithms.  In  ssetion  V  various  aspsets  of  implsmsnting 
ths  algorithms  ars  considsrsd,  with  particular  smphaslzs  on 
parallsl  and  pipslinsd  architset urss  for  high  throughput 
appl ieat ions. 

Lst  ths  numbsr  of  operation*  in  dirset  computation  of  Y  bs 

denoted  by  D.  Ths  proposed  algorithms  are  shown  to  result  in 

replacing  D  by  0(D/log  D).  The  proport  1 ona 1 1 1 y  factor 

associated  with  this  "order  of"  estimate  is  m  ths  range  1  to  4 

for  various  instances  of  ths  algorithms,  as  well  as  for  their 

combination  as  shown  in  section  VI.  Ssnsra 1 1 zat ions  of  this  work 

• 

ars  suggested  also  in  ssetion  VI,  whils  suggestions  for 
applications  and  topics  for  further  research  conclude  ths  paper 


£ 


in  ssetion  VII 


II.  ALGORITHM  BASED  ON  OUTPUT  PARTITIONING 


In  this  section  ws  discuss  an  algorithm  that  computes  a  set 
of  concurrent  partial  sums  by  partitioning  them  optimally  into  a 
number  of  subsets.  Each  subset  is  computed  independently, 
applying  the  concept  of  redundancy  removal  which  is  best 
introduced  first  through  the  following  character ist ic  case, 
essential  to  the  algorithm. 

A.  First  Characteristic  Case 

Consider  the  computation  of  the  partial  sums  of  (1.1) 

Y  -  BX  >  U  (2.  1 ) 

for  B  with  dimensions  r  x  n,  where 

n  ■  2r  -  1  (2.2) 

In  addition,  all  the  columns  of  B  are  distinct  and  none  of  all 
zero  entries.  This  implies  that  the  entries  of  each  columns  of  B 
corresponds  to  ths  binary  representat ion  of  one  of  the  integers 
<1,  2,  . ..,  2r  -  1 >.  Also,  each  row  of  B  contains  exactly  2r_1 

ones  and  2r'-1  -  1  zeros. 

The  algorithm  is  comprised  of  two  steps  that  are  applied 
alternately  until  all  partial  sums  are  computed! 

St  IP  .1 

Compute  one  of  the  partial  sums.  This  requires  Aii  »  2r“ 1 
additions  and  eliminates  one  row  from  B,  and  the  correspond lng 
entry  of  U.  There  are  now  two  identical  columns  of  B 
correspond i ng  to  each  of  the  binary  representat i ons  of  the 
numbers  <1,  2,  ....  2r”l  -  1>  and  one  column  with  all  zero 


entries 


sitp  a 

Rtmovt  the  iero  column  from  B  and  the  corresponding  entry  from  X. 
Remove  ona  of  each  two  identical  column*  of  B  after  adding  thi 
corresponding  antria*  of  X.  Thi*  raquira*  Aig  ■  2»*“1 

add it  ions. 

Tha  two  staps  comprise  ona  iteration  of  tha  algorithm.  Th< 
first  execution  of  the  algorithm  requires  Aj  ■  Aii  +  Aig  »  2r  - 
and  replaces  r  by  r  -  1.  The  ith  execution  requires  Aj.i  »  gr—i, 
ftig  a  gr-i  -  i  additions  for  steps  1  and  2  respectively, 
resulting  in 


Ai  -  An  +  «i2 


2r-i  +  !  -  i,  i  -  i,  2 . . 


(2.  3) 


From  aq. (2.2.  2.3)  tha  total  number  of  additions  to  compute  all 
partial  sums  is  found  to  be 


A  (n)  -  E  Aj 
i-1 


*2<2r-l)  -  r«*£n-r 
*  2n  -  log  <n+l )  =  0<n>  -  2n 


(2.  4) 


log  is  to  the  base  2  throughout  thi*  paper.  Tha  number  ol 
additions  par  output  is  than 

C(n)  ■  (S (2r  -  1)  -  r)/r  =  2<n/r)  -  1 

“  2  <n/ log  <n+l )  )  -la  0<n/log  n>  ^  2n/log  n  (2.5) 
Since  aach  row  of  B  contains  2r"l  ones  in  this 
charactarist ic  case,  a  direct  computation  of  each  partial  sun 
independently  results  in  the  following  number  of  additions  par 


output 


D(n)  ■  gr-1  ■  (n  +  n/g  a  b(n)^  n/2 


(2.  6) 


Tha  efficiency  of  our  approach  in  comparison  to  direct 


computation  could  be  expressed  by  the  ratio 

>?<n)  ■  C(n)/D(n)  ■  4<n  —  ( log  <n+l  )  ) /2>  /  <n+l )  log  (n+1 ) 

■  0(l/log  n)  -  4/log  n  (£.7) 

or  equivalently  by  the  expressions 
n<n>  ■  0( 1/log  D <n) ) , 

C(n>  -  0<D(n)/log  D(n))  (£.8) 

which  indicate  the  type  of  computational  savings  achieved  by  the 
proposed  algorithm. 

An  important  aspect  of  this  algorithm  is  its  invariance  with 
respect  to  the  partitioning  within  the  r  outputs  in  steps  1  and  £ 
above.  So,  instead  of  computing  one  output  at  a  time,  followed  by 
merging  inputs  corresponding  to  redundant  computation  for  the 
remaining  outputs,  we  could  partition  the  r  outputs  into  two  sets 
of  r i  and  r2  outputs  each.  Due  to  the  assumption  of  distinct 
columns  of  B,  and  the  carefully  chosen  n  according  to  (£.8),  the 
computations  involved  in  the  rj  and  r2  outputs  are  mutually 
exclusive  and  could  be  computed  independently.  The  ri  outputs 
require  2r  -  1  -  <£ri  ~  1>  additions  to  add  inPuts  correspond inD 
to  identical  columns  of  Blf  followed  by  A(ni>  additions  as 
determined  by  the  above  characterist ic  case,  but  with  dimensions 
nj  and  rj,.  Similar  argument  for  the  remaining  r2  outputs  lead  to 
A  (n)  -  A <ni )  +  £**  -  1  -  (2ri  “  l>  + 

A(n2>  +  ar  -  i  -  <2r2  -  1 )  + 
which*  with  r  ■  rj  +  r2  lead  to 
A(n)  ■  2n  -  r 

identically  to  (2.4).  For  n  a  power  of  2,  a  scheme  of 
progressi vely  finer  part  it ioning,  and  combining  data  with 


identical  roles  in  the  required  computation  is  possible  here.  At 
every  step,  the  number  of  parallel  computations  doubles  until  all 
outputs  appear  simultaneously  at  the  last  step.  In  the  fastest 
implementat ion  of  an  adder  tree  to  be  discussed  in  the  sequel, 
and  regardless  of  the  partitioning  method,  and  if  all  data 
additions  are  performed  by  adder  trees,  then  all  the  outputs  will 
be  computed  after  a  time  period  corresponding  to  r  -  1  additions. 

The  dimensions  of  this  character i st ic  case  were  carefully 
chosen.  Therefore,  only  the  redundancy  removal  aspect  of  the 
algorithm  was  applied.  In  the  sequel  we  consider  problems  with 
arbitrary  dimensions  and  show  how  redundancy  removal  is  combined 
with  optimal  partitioning  to  result  in  a  complete  application  of 
the  algorithm. 

B.  The  General  Case 

To  apply  the  above  approach  to  the  general  case  with 
arbitrary  dimensions  (1.1),  we  partition  the  M  partial  sums  Y, 
and  similarly  U,  into  s  sets  Yx,  i  =  1,  £,  .  ..,  5,  of  r  partial 

sums  each.  The  parameters  r  and  s  are  related  by 

M  “  rs  (2.  9) 

Furthermore,  we  assume  that  r  satisfies 

N  £_  2»*  -  1  (2.  10) 

For  a  worst  case  analysis,  we  assume  that  all  the  distinct  2^-1 
nonzero  binary  vectors  are  present  in  each  Bj,.  For  each  group  of 
r  partial  sums  we  begin  by  executing  step  2  as  introduced  above. 
This  requires  N  -  (2^-1)  additions  per  Bi.  The  problem  is  now 
identical  to  the  characterist ic  'case  discussed  above,  and 
therefore  £<2r  -  1)  -  r  additions  per  r  partial  sums  are  needed 
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to  complete  the  computation.  So  the  number  of  addition®  per  r 
partial  sums  is,  in  this  worst  case  analysis 

fl<N,r)  -  (N  +  zr  -  r  -  1)  (S.  11) 

Equivalently,  the  number  of  additions  per  output  is 
C  <IM,  r)  »  <N+2r-r-  l)/r 

■  (N  +.  2r  -  1 )  /r  -  1  <2.  IS) 

There  are  at  least  two  approaches  to  investigate  the  optimal 
value®  of  r  at  which  C(N, r)  attains  its  minimum.  In  the  first 
approach  the  minimum  is  found  by  treating  r  as  a  continuous 
variable.  The  derivative  of  C  with  respect  to  r  vanishes  at 

N  -  2r (r  In  2  -  1)  +1  (2.13) 

at  which  C  attains  its  minimum  value.  There  is  no  explicit  closed 
form  expression  of  r  as  a  function  of  N  that  could  be  obtained 
from  eq. (2.13).  However,  such  an  expression  is  not  essential  in 
applying  the  algorithm  where  only  integer  values  of  r  are  of 
interest.  The  first  integer  value  of  r  for  which  (2. 10,2. 13)  are 
simultaneously  valid  is  3.  P  table  of  the  values  of  N 
correspond i ng  to  r  ■  3,  4,  ...  could  be  formed  to  cover  the  range 

of  values  of  N  of  interest.  The  initial  estimate 

r l  ■  log  <N-1 )  (2. 14) 

could  be  used  either  to  identify  the  range  of  values  of  r  that 
are  then  used  to  generate  a  small  table  via  (2.13),  or  as  the 
initial  term  in  the  iteration 

r i+i  ■  log (N-l )  -  log(ri  In  2  -1),  i  =  1,  2,  ...  (2.15) 

which  converges  rapidly  to  the  solution  of  (2.13).  With  ri  as 
defined  in  (2.14),  the  next  two  iterations  result  in 
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r  £  - 

log (N-l )  - 

log ( In (N-l ) 

-  1), 

r3  - 

log(N-l)  - 

log ( In (N-l ) 

-  In ( In (N-l )  -  1) ) 

(2.  IS) 

From  (2.12,  2.13)  we  obtain  the  minimum  number  of  additions  per 

output  corresponding  to  the  optimal  value  of  r 

C  -  < (N— 1 ) In  2  /  <r  In  2  -  1))  -  1  (2.17) 

which  with  (2. 14,  2. IS)  result  in 

C  i  ( (N  -  1)  /  (log (N-l )  -1))  —  1  —  N  /  log  N  (£.18) 

Direct  computation  requires  an  average  of 

D  =  N/2  (£. 19) 

additions  per  partial  sum.  The  efficiency  of  the  proposed 
algorithm  is  characteri zed  by 

n  »  C/D  *  2/log  N  (£. 20) 

which  is  a  conservative  estimate  of  its  performance,  since  we  are 
comparing  our  worst  case  to  the  average  direct  computation.  In 
this  algorithm,  all  the  blocks  of  r  partial  sums  are  computed 
independent ly.  Redundancy  is  removed  only  within  each  block. 
Further  redundancy  could  be  removed  based  on  computation  shared 
between  blocks.  This  additional  redundancy  is  insi gni f icant  and 
its  removal  would  require  complicated  communication  schemes.  A 
second  approach  in  invest i gat ing  the  optimal  values  of  r  is 
presented  in  section  IV,  where  r  and  N  are  assumed  to  be 
integers. 


a 


III.  A  DUPL  ALGORITHM 


In  this  section  we  introduce  *n  algorithm  based  on  input 
part  it loning.  This  algorithm  is  a  dual  to  the  one  discussed  above 
with  the  input  and  the  output  roles  exchanged  in  all  significant 
expressions  and  statements.  There  are,  however,  differences  in 
implementing  the  two  algorithms.  There  is  also  a  subtle 
difference  in  generalizing  the  two  algorithms  to  operations  other 
than  addition.  These  differences  will  be  addressed  in  the 
following  sections. 

The  following  charact er 1 st 1 c  case  is  of  essence  since  it 
illustrates  the  redundancy  removal  aspect  of  this  dual  algorithm. 

A.  Second  Characteristic  Case 

Consider  the  computation  of  the  partial  sums  of  (1.1) 

Y  -  BX  +  U  (3.  1> 

but  for  B  with  dimensions  n  x  r,  where 

n  -  2r  -  1  (3.  2) 

In  addition,  all  the  rows  of  B  are  distinct  and  none  of  all  zero 
entries.  This  implies  that  the  entries  of  each  row  of  B 
correspond  to  the  binary  reDresert at i on  of  one  of  the  integers 
<1,  2,  ...,  2r  -  1>.  Also,  each  column  of  B  contains  exactly 
2r”l  ones  and  2^-1  -  i  zeros.  B  is,  in  this  case,  the  transpose 
of  that  in  the  first  charact er i st l c  case.  Let  us  first  ignore  U. 
Computing  Y  in  this  case  amounts  to  computing  all  the  partial 
sums  of  the  entries  of  X. 

Let  P(r)  be  the  number  of  additions  required  to  compute  all 
the  2r  -  1  nonzero  partial  sums  of  r  elements  of  a  set.  If  one 
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mors  element  is  added  to  the  set,  then  the  best  that  could  be 
done  is  to  add  the  new  element  to  each  of  the  exist ing  partial 
sums.  This  requires  -  1  additions.  This  is  the  smallest  number 
of  additions  needed  to  generate  all  the  additional  partial  sums 
that  include  the  new  element.  The  result  is  the  following 

recursion 

P(r+1)  =  p(r>  +  <£r  -  i)  (3.3) 

which  results  in 

P  <r>  =  £r  _  r  _  i  (3.  4) 

From  eq. (3.2,  3.4)  the  total  number  of  additions  to  compute  all 

partial  sums  is  found  to  be,  after  including  n  additions  for  U 

A (n)  *  2  (2r  -  1)  -  r  =  2n  -  r 

»  2n  -  log  (n+1 )  -  0(n>  *  £n  (3.5) 

which  is  identical  to  (2.4),  but  with  r  and  n  indicating  the 
number  of  columns  and  rows  of  B  respectively  in  this  dual  case. 
The  above  result  could  also  be  obtained  if  the  r  inputs  are 

partitioned  into  two  subsets  of  ri  and  r£  inputs  each, 

respectively.  All  the  partial  sums  of  each  subset  are  computed, 
and  the  results  combined  to  obtain  Y.  Regardless  of  the  method  of 
part  it  ioning,  the  fastest  implernentat ion  would  result  in  all  the 
outputs  after  a  time  corresponding  to  r  -  1  additions,  with  some 
components  of  Y  computed  even  sooner.  The  number  of  additions  per 
input  is  then 

C  (n)  -  (2  ( 2r  -  1)  -  r)/r  -  2(n/r)  -  1 

■  2 (n/ log  <n+l ) )  -  1  2  2n/log  n  (3.6) 

Direct  computation  of  the  partial  sums  requires  a  number  of 
additions  that  is  equal  to  the  number  of  ones  in  B.  This  is  easy 
to  see,  since  each  row  of  B  corresponds  to  a  number  of  additions 
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of  cntntf  of  X  that  is  one  less  than  the  number  of  ones  in  that 
row.  An  extra  addition  is  required  to  include  the  correspond i ng 
element  of  U,  thus  resulting  in  one  element  of  V.  Since  B  in  this 
characterist ic  case  is  the  transpose  of  that  of  the  first 
character ist ic  case,  it  has  the  same  total  number  of  ones  of 
r2r“l .  The  number  of  additions  per  input  is,  therefore 

D  (n)  ■  2r"  1  -  < n  ♦  1  ) /2  ^  n/2  (3.7) 

which  is  identical  to  <2.S),  but  with  the  roles  of  the  rows  and 
columns  l nterchanged. 

B.  The  General  Case 

To  apply  the  above  approach  to  the  general  case  with 
arbitrary  dimensions  (1.1),  we  need  to  combine  the  redundancy 
removal  aspect  of  the  algorithm  as  introduced  above  with  optimal 
part  it ioning.  Let  the  N  inputs  of  X  be  partitioned  into  s  sets 
Xj,  i  *  1,  2,  . ..,  s,  of  r  elements  each.  The  parameters  r  and  s 
are  related  by 

N  a  rs  (3.  8) 

Furthermore  we  assume  that  r  satisfies 

M  L  2r  “  1  (3.  9) 

To  compute  all  the  partial  sums  of  each  of  the  s  sets  of  entries 
of  the  Xt’s,  a  total  of  (2r  -  r  -  l)N/r  additions  are  needed. 
This  follows  from  (3.4)  and  (3.8).  Each  partial  sum, 
corresponding  to  one  entry  of  V,  could  then  be  obtained  at  the 
cost  of  N/r  extra  additions.  This  includes  the  proper  entry  of  U. 
The  total  of  the  extra  additions  to  obtain  v  is  w/r. 
additions  to  compute  all  partial  sums  is  then 

1  1 


The  total 


ft  (M,  N,  r )  -  (2r  -  r  -  l)N/r  MN/r  <3.10 

and  th*  numbar  of  additions  per  r  inputs  is 

ft  <M,  r )  ■  <M  ♦  2r  -  r  -  1)  (3.11) 

which  is  ths  dual  of  <2.11)  with  M  rsplacinQ  N.  Equivalent ly,  ths 
numbar  of  additions  psr  input  is 

C  <M,  r >  -  CM  ♦  2r  -  r  -  1  > /r 

■  <1*1  2r  -  l)/r  -  1  (3.12) 

which  is  also  th*  dual  of  (2.12).  The  above  establishes  the 
complete  duality  of  the  two  algorithms.  The  remainder  of  our 
analysis  is  identical,  via  duality  and  proper  exchange  of 
variables,  to  that  of  section  II.  Comments  on  redundancy  between 
blocks  are  similar  to  those  made  at  the  end  of  section  II,  but 
with  the  roles  of  N  and  1*1  interchanged. 

IV.  EFFECT  OF  INTEGER  PftRftMETERS 

There  are  several  effects  of  restricting  the  parameters  N, 
M,  and  r  to  integer  values.  We  will  be  concerned  mainly  with  the 
first  algorithm  of  output  part  it icning,  since  duality  extends  the 
results  immediately  to  the  second  algorithm.  Let  us  ivestigate 
first  th*  effect  of  restricting  r  and  N  to  be  integers.  Equation 
<2. 12)  could  be  rewritten  in  the  form 

C  ( N,  r )  -  N/r  «■  (2*“  -  r  -  l)/r  (4.1) 

This  could  be  viewed  as  a  straight  line  function  of  N  with  a 
slop*  of  1/r  and  a  displacement  of  <2r  -  r  -  l)/r,  both  of  which 
are  parameter i zed  by  r.  Let  us  generate  these  straight  lines  for 
r  »  l,  2,  ....  Th*  lowest  upper  bound  of  this  collection  of 

straight  lines  is  a  piecewise  linear  curve,  each  segment  of  which 
is  a  part  of  on*  of  the  above  straight  lines  that  corresponds  to 


This 


a  particular  integer  valua  of  r  as  dapictad  in  Fig.  1. 
piacawisa  linaar  curva  rapraaants  tha  laast  possibla  numbar  of 
additions  par  output  C  as  a  function  of  just  N,  since  r  was  used 
as  a  parameter  in  generating  this  lowest  upper  bound.  Tha 
vertices  of  tha  resulting  piecewise  1 inaar  curva  are  tha  points 
at  which  C(N, r)  -  C(N, r+1),  and  occurs  at  Nr  where 

IV-  2 r<r  -  1).  +  l  (4.2) 

The  range  of  values  of  N  for  which  a  given  value  of  r  should  be 
used  is  NeCNr-i,  Nr3.  This  range,  followed  next  by  the  related 
difference  and  ratio 

2^-1  <r-2)  +  1  1  N  1  2r  <r— 1 )  +  1, 

Nr  -  Nr— 1  -  r  2»"-l, 

Nr/Nr-i  -  (2^  <r— 1 )  +  1) / <2»*-l (r-2>  +  1 )  *  2  (4.3) 

should  be  used  in  assessing  the  value,  or  range  of  values  of  r 
that  correspond  to  the  range  of  values  of  N  in  a  given 
application  or  problem. 

One  way  of  relating  the  optimization  of  r  as  a  continuous 
variable  to  that  of  integer  r  is  as  follows.  Every  segment  of 
the  piecewise  linear  curve  defined  above  is  a  tangent  to  the 
curve  of  minimal  C  as  a  function  of  N  that  results  from  treating 
r  as  a  continuous  variable.  The  two  curves  are  farthest  apart  at 
Nr,  r  »  1,  2,  ....  At  these  points  we  get  from  (4.1,  4.2) 

C  -  2r  -  i  .  (Nr  -  l)/(r-l)  -  1 

-  0<N/log  N)  2s  N/  log  N  (4.4) 

which  exhibits  the  same  asymptotic  behavior  as  for  continuous  r. 
This  indicates  the  insensit i vity  of  the  optimal  performance  to 
small  changes  in  r. 
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The  effect  of  restricting  N  to  be  integer  values  is  simply 


incorporated  by  considering  only  those  points  on  the  above 
piecewise  linear  curve  that  correspond  to  integer  N. 

The  last  issue  of  importance  here  concerns  1*1  and  s.  It  is  obvious 
that  the  optimization  makes  sense  only  if  M£_r  in  the  first 
algorithm,  and  N>_r  for  the  dual  one.  Other  than  that,  if  s  is  not 
an  integer,  then  all  the  partitioned  parts  of  the  problem  will 
not  be  of  the  same  size.  In  this  case  we  simply  use  one  of  the 
nearest  integer  values  to  s,  and  only  minor  deviation  from  the 
optimal  behavior  should  result.  Me  can  limit  the  difference  to 
only  one  of  the  partitioned  systems,  or  try  to  make  them  all  as 
close  as  possible  with  a  difference  of  one  row  (column  in  the 
dual  case)  at  the  most  between  any  pair. 

V.  IMPLEMENTATION  CONSIDERATIONS 

In  this  section  arch itect ures  that  implement  the  above 
algorithms  are  introduced.  Only  the  basic  concept  of  each 
implementation  are  considered,  since  details  are  better  left  for 
individual  appl icat ions. 

A.  The  Output  Partitioning  Algorithm 

Since  every  group  of  r  partial  sums  is  evaluated 
independently,  it  suffices  to  consider  the  implementat ion  of  one 
such  group.  A  parallel  architecture  is  then  obtained  by 
replicating  this  implementat ion  s  times.  Next,  two  types  of 
implementat ion  are  discussed. 

If  the  data  is  obtained  seduent lal ly,  then  a  pipelined 
architecture  is  particularly  suitable.  We  examine  here  an 


architecture  that  implements  the  algorithm  for  orly  one  group  of 
r  partial  sums,  which  is  then  replicated  in  parallel  or  used 
sequentially  for  the  complete  implementation.  This  architecture 
is  based  on  a  RAM  and  an  adder.  Each  data  item  is  associated  with 
the  corresponding  column  of  Bi  which  is  used  as  an  address  tag. 
The  contents  of  the  memory  at  this  address  are  read,  added  to  the 
data  and  restored  in  the  same  location.  It  is  clear  that  this 
simple  architecture  adds  data  corresponding  to  identical  columns 
of  Bi-  After  all  the  data  is  obtained,  we  begin  computing  the 
partial  sums  by  reading  all  the  data  from  locations  with  1  as  the 
last  bit  of  their  address  and  accumulating  it,  using  the  adder. 
The  data  is  then  read,  the  last  bit  of  its  address  tag  removed, 
and  then  applied  to  a  similar  memory-adder  architecture  but  with 
half  the  previous  memory  size.  clearly,  the  above  implements  the 
two  steps  of  the  algorithm.  The  above  approach  could  be  modified 
in  a  number  of  ways  to  adapt  it  to  a  particular  environment,  such 
as  a  mi  croprocessor  i  rnp  1  ement  at  1  on  or  a  particular  bus 
arch  it act ure. 

Our  purpose  here  is  to  present  only  the  basic  approaches, 
but  since  high  throughput  applications  are  of  particular 
interest,  the  following  pipelined  architecture  is  of  particular 
Importance. 

The  data  is  passed  only  once  in  the  above  memory-adder.  The 
data  read  from  the  memory  adder  is  obtained  in  the  desired  order 
of  an  increasing  address  tag.  Since  the  data  is  now  well  ordered, 
and  with  distinct  address  tags,  the  partial  sums  could  be 
computed  in  an  adder  tree  that  is  structured  to  implement  the 


•taps  of  tha  algorithm  as  dapicted  in  Fig.  £.  This  addar  tr«*  can 
••rva  tavaral  mamory-addars  intarfacad  to  it  via  sarial  in  / 
parallal  out  shift  ragistars. 

B.  Th #  Input  Partitioning  Algorithm 

This  algorithm  is  naturally  suitad  for  parallal 
implamantat ion.  Oil  tha  partial  sums  of  aach  group  of  r  inputs 
ara  computad  indapandant ly,  and  aach  output  is  obtainad  by 
collacting  ona  of  tha  partial  sums  from  aach  input  group  and  ona 
of  tha  antrias  of  U.  Wa  Mill  considar  tha  computation  of  all  tha 
partial  sums  of  only  ona  of  tha  groups  of  r  inputs,  sines  this 
could  ba  usad  as  tha  building  block  of  various  implamantat ions. 
Pn  alagant  struct urad  addar  traa  arch  it act ura  that  compotas  all 
tha  partial  sums  of  r  numbars  is  basad  on  tha  racursion  of 
aq.  (3.3).  Pa  illustratad  in  Fig.  3,  tha  traa  architactura 
implamants  tha  racursion  diractly.  Tha  thraa  partial  sums  of  tha 
first  two  inputs  ara  addad  aach  to  tha  third  input,  to  rasult  in 
tha  dasirad  savan  partial  sums.  This  "nasting"  could  ba  rapaatad 
as  many  timas  as  naaded.  P  parallal  architactura  basad  on  copias 
of  such  a  traa  is  obvious.  P  pipalinad  architactura  is  also 
possibla,  whara  tha  sats  of  r  input  ara  appliad  saquantially  to 
ona  traa.  Of  coursa  all  tha  r  inputs  of  aach  sat  ara  appliad  in 
parallal  to  tha  traa. 


VI.  COMPARISONS  AND  GENERALIZATIONS 


In  this  section,  ws  consider  the  performance  of  the  above 
two  algorithms  Mhen  they  are  both  available  simultaneously.  Also, 
general izat ion  of  the  algorithms  and  their  cost  functions  are 
discussed. 

A.  Performance  of  The  Combined  Algorithms 

The  total  number  of  additions  to  compute  V  of  (1.1)  via  the 
output  partitioning  algorithm  could  be  readily  shown  from  the 
analysis  in  section  II  to  satisfy 

At  »  0  (MN/  log  N)  2:  MN/log  N  (6.1) 

while  for  input  partitioning  we  obtain 

At  -  0 (MN/ log  M)  ^  MN/log  M  (6. 2) 

The  performance  of  the  two  algorithms,  combined,  is  determined  by 
At  *  MN/max<log  N,  log  M>  (6.3) 

which  results  m 

At  -  2MN/  log  NM  (6.4) 

as  a  conseauence  of  the  obvious  inequality 

MN/max<log  N,  log  M)  <  2MN/<log  n  *  log  M)  *  2MN/log  NM 
The  total  number  of  additions  required  for  direct  computation  is, 
in  the  average 

Dt  -  MN/2  (6. 5) 

The  efficiency  of  the  combined  algorithms  is  then  determined  by 
At/  Dt  -  4/log  NM  (6.  6) 

which  is  achieved  by  using  the  output  partitioning  algorithm  if 
N>_M  and  its  dual,  the  input  partitioning  one  if  m>  n.  The  two 
algorithms  are  combined  here  only  in  the  sense  that  they  are  both 


available.  Only  one  of  them  is  used,  however,  in  any  given 
instance  as  determined  by  the  values,  or  range  of  values  of  M  and 
N.  Of  course,  for  the  case  of  M  =  N  the  two  algorithms  are 
equally  effective,  and  only  other  implementation  considerat ions 
could  favor  either. 

B.  Generalizations 

The  output  partitioning  algorithm  is  applicable  to 
concurrent  partial  computations  under  any  operation  that  is  both 
associative  and  commutative.  The  addition  is  only  one  such 
operation.  Commutativity  is  required  since  the  algorithm  might 
perform  computations  on  the  input  data  in  other  than  the  order  in 
which  it  was  given.  The  dual  algorithm  however  could  be  applied 
without  requiring  commutativity  and  is  therefore  applicable  in 
computing  expressions  in  any  associative  operation.  For  example 
it  could  be  used  in  computing  partial  products  of  a  set  of 
matrices  that  are  not  necessarily  commutative.  It  is  also 
possible  to  apply  these  algorithms  to  expressions  in  finite 
fields  and  rings  as  well  as  Boolean  algebra  ones. 

The  cost  function  used  need  not  be  restricted  to  the  number 
of  operations.  For  example  in  VLSI  applications,  a  properly 
defined  cost  function  might  include  terms  to  reflect  area  and 
time  delay.  Also  several  partition  parameters  might  be  present  in 
the  cost  function. 


VII.  CONCLUSION 


Two  novil  algorithm*  for  simultaneous  computation  of  a  large 
number  of  partial  sums  are  introduced  and  their  performance 
assessed.  The  direct  computation  of  D  operations  are  replaced  by 
0(D/log  D) .  The  new  approach  is  based  on  a  new  concept  of  optimal 
partitioning  and  redundancy  removal  in  arithmetic  intensive,  high 
throughput  computing  that  is  expected  to  be  the  basis  of  a  new 
class  of  algorithm*.  This  factor  of  0<l/log  D)  reduction  in  the 
required  computation  appears  to  be  generic  to  the  new  class  of 
algorithms  and  is  expected  to  appear  in  their  application  to 
other  problems  such  as  vector  matrix  multiplication  and  other 
linear  algebra  operations. 

The  new  algorithms  represent  a  departure  from  brute  force 
parallel  computation  where  inherent  redundancy  is  not  detected  or 
removed.  The  resulting  arch itect ures  admit  of  systolic 
implementat ion  in  part,  but  also  require  some  form  of 
broadcasting  certain  computations  .  The  arguments  for  systolic 
processing  C33  are- valid,  but  for  sufficiently  large  problems, 
the  removal  of  redundancy  via  the  new  algorithm*  could  result  in 
fundamental  improvement.  For  example,  a  preliminary  assessment  of 
an  algorithm  of  the  new  class  for  a  parallel  vector  multiplier 
accumulator  chip  indicates  that  with  VHSIC  II  implementation,  at 
least  three  times  more  multipliers  could  be  accommodated  in 
comparison  to  direct  implementat i on.  The  new  approach  is  flexible 
and  could  be  optimized  under  various  cost  functions,  such  as  chip 
area,  number  of  operations,  time,  time  area  product  ...etc.  It 
also  could  be  applied  on  various  levels;  system,  board  , or  chip. 
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Since  broadcasting  is  particularly  simple  in  optical  computing, 
the  new  algorithms  could  be  of  particular  value.  This  could  have 
an  impact  on  significant  applications  as  in  £43. 

Application  to  FIR  filters,  utilizing  their  full  structure, 
should  unify  and  improve  the  result  in  £5,63.  Optimization  of 
various  algorithms  for  large  chips,  combining  the  new  algorithms 
with  techniques  of  the  type  in  £73  is  also  one  of  the  objectives 
of  this  research.  Since  the  approach  is  directly  applicable  to 
finite  fields  and  rings,  it  could  also  be  considered  for  optimal 
coding  /  decoding  arch itectures  and  for  computations  with  a 
variety  of  arithmetic  systems.  The  applicability  to  Boolean 
expressions  mentioned  in  section  VI  could  result  in  a  new 
approach  to  PLA  chip  design. 

Finally,  we  propose  the  following  topics  for  further 
research  t 

1-  Developing  algorithms  and  architectures  of  the  proposed  class 
for  vectoi — matrix  multiplication,  other  linear  algebra 
operations,  and  for  key  signal  processing  algorithms  and  filters. 

2-  Optimizing  the  algorithms  for  software  and  hardware 
implementat ion,  including  advanced  technology  chip  sets  of  VHSIC 
II  and  GaAs  types  as  well  as  those  utilizing  optical  components. 

3-  Assessing  the  performance  of  the  new  algorithms  and 
arch itect ures  in  their  various  implementat ion  modes  for  certain 
supercomputer  systems  and  for  key  fast  signal  processing 
applications,  such  as  digital  beam  forming,  target  ident i f icat ion 
and  other  radar  and  communication  applications. 
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Figure  Captions 


Fig.  1.  The  minimal  number  of  operat ions  C  as  a  function  of  N. 
Here  r  is  treated  as  an  integer  parameter  in  the 
generation  of  this  piecewise  linear  curve.  The  optimal 
values  of  r  are  indicated  in  the  corresponding  ranges  of  N.  The 
behavior  of  the  input  partitioning  algorithm  is  obtained  via 
duality  by  just  replacing  N  by  M  above.  In  this  architecture,  the 
computation  of  all  the  outputs  is  completed  after  a  time 
correspond ing  to  r-1  additions,  but  some  of  the  outputs  are 

CGmput  5u  SOOfiSr. 


Fig.  2.  ft  structured  adder  tree  architecture  for  a  pipelined 
implementat ion  of  the  output  partitioning  algorithm.  Several 
memory-adders  are  served  with  on#  such  tree.  The  outputs  from 
the  memory-adders  are  multiplexed  to  the  tree  via  a  bank  of 
serial  in/  parallel  out  shift  registers.  The  depicted  case  is  for 
r  ■  4.  The  data  at  the  rightmost  position  has  an  address  tag 
correspond  1 ng  to  1,  while  the  leftmost  one  corresponds  to  15.  In 
such  an  arch itecture,  all  the  outputs  are  available 
symultaneously  after  a  time  correspond ing  to  r-1  additions. 


Fig.  3.  ft  pipelined  architecture  for  computing  all  the  partial 
sums  of  a  set  of  numbers.  The  nested  arrangement  shown 
illustrates  how  this  architecture  is  based  on  the  recursion  of 
<3.  3).  The  depicted  case  corresponds  to  r  *»  3,  where  the  three 
partial  sums  of  the  first  two  inputs  are  added  to  the  third  input 
to  result  in  all  the  required  seven  partial  sums. 
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